Majority of Brazilians are of mixed race according to IBGE. Racial and genetic admixture integrate demographic and health information and it is in public interest that genetic data be used to improve the health system (SUS) of Brazil in the context of ohmic databases to construct machine learning and artificial inteligence tools. The term pardo in current Brazilian Portuguese refers to individuals of mixed ancestry but was originally used by Pero Vaz de Caminha in 1500 to describe the skin color of Native Brazilians at the contact with Portuguese sailors. Integration of genomic and ancestry information of Brazilians into public health data, such as the data provided by Information Technology Department of SUS - DATASUS, can help estimate genetic risk of diseases and propose policies to improve diagnosis, allocation of resources, services, and therapies for population groups at higher genetic risk of diseases. For better use of genomics in these processes it is necessary to better discuss the use of the term pardo as a biomarker of racial and color identification in Brazil to classify individuals of admixed ancestry. Here we introduce a computational framework in the R language - GERALDA - that used mitochondrial variants to estimate the genetic risk of neuroblastoma and neurodegenerative diseases in Brazilians. This work shows an increased genetic risk of disease and integrate on cognition, morbidity, and mortality of Brazilians using mitochondrial DNA variants. This information can be used for the organization of the public health system, contributing to the rational use of resources by the health system.
Our group identified hypoxia as a triggering signal for the cellular transition from adrenergic (ADRN) to mesenchymal (MES) cells in neuroblastoma. We found that the transition is mediated by the deposition of the epigenetic marker 5-hydroxymethyl-cytosine (5-hmC) via the ten eleven translocation enzyme (TET1), a functional mechanism of hypomethylation. We are investigating hypomethylation patterns via 5-hmC in immortalized cells, tumor samples and cell-free DNA (cfDNA) isolated from peripheral blood liquid biopsies of neuroblastoma patients. Repetitive genomic elements, or repetitive regions of the genome, are part of the non-coding region and are important for the maintenance of the pluripotent cellular state of stem cells, which we call the de-differentiated state. Mesenchymal stem cells (MES) are dedifferentiated and their gene expression pattern shows a high correlation with the stem and pluripotent cell state.
We evaluated the deposition of the epigenetic marker 5-hmC in repetitive regions of the genomes of cells, tumors, and cf-DNA of neuroblastoma patients for oncological management of patients. We aim to incorporate liquid biopsy with sequencing of 5-hmC in repetitive element markers into the routine of public precision medicine programs for neuroblastoma in Brazil. To this end, we propose the construction of a computational database that incorporates genomic and demographic data available in the health system, to allow better data analysis, machine learning, and artificial intelligence using ohmic marker data for management of nervous system diseases in the public health system, together with the Center for Data Integration and Knowledge for Health of the Gonçalo Moniz Institute at Fiocruz in Bahia.
Neuroblastoma is a pediatric cancer of the peripheral nervous system. Tumors are composed of two main cell lineages: adrenergic (ADRN) and mesenchymal (MES) cells. These cells can interconvert using enhancers and superenhancers Groningen et al. (2017) (van Groningen et al. 2017; Boeva et al. 2017), epigenetic markers in noncoding genome regions identified using machine learning by hierarchical clustering (Van Groningen 2017) and principal component analysis (Boeva et al., 2017). Despite the importance of the mechanism of ADRN-to-MES cell interconversion, the cellular signals that trigger the transition are not understood. ADRN cells are neuron-like and sensitive to chemotherapy (Figure 1, left), while MES cells are similar to undifferentiated or dedifferentiated stem cells (Figure 1, right) and responsible for resistance to chemotherapy and immunotherapy (Van Groningen et al., 2017; Kendsersky et al., 2022; Mabe et al., 2022).
Hypoxia is the condition of limited oxygen supply for tumor growth. Among extracellular signals that activate 5-hmC deposition via TET1, our group at the University of Chicago identified hypoxia as a signal that activates gene expression by 5-hmC deposition and methylation removal (Mariani et al., 2014). Our studies and those of other groups have shown that hypoxia drives dedifferentiation in neuroblastoma cells (Jögi et al. 2002; Mariani et al. 2014; Hains et al. 2022). We identified a functional epigenetic demethylation mechanism mediated by Ten Eleven Translocation (TET) family enzymes, and genes activated by hypoxia in the transition from the ADRN to MES cellular state (Chaves et al. 2024 - in preparation), (Figure 5). Our group’s results propose a functional mechanism of hypoxia-driven methylation removal and gene expression activation (Mariani et al., 2014; Hains et al., 2022; Chaves et al. 2024 - in preparation) (Figures 2 and 5). Our work suggests hypoxia as a mechanism for maintaining cellular dedifferentiation, consistent with the hypomethylation of repetitive transposable elements (Figure 5).
Most CpG islands are located in repetitive intergenic regions of the genome (Lanciano and Cristofari 2020) and, when methylated, serve to maintain an inactive state of DNA transcription (Fetahu and Taschner-Mandl 2021). Repetitive transposable DNA elements (TEs) are mobile genetic elements that make up a large fraction of the genome, reaching 15% in C. elegans and 85% in maize (Lanciano and Cristofari, 2020). These percentages show the potential of these elements as biomarkers (Lanciano and Cristofari, 2020). TEs have been considered “junk DNA” (Burns 2017; Ma et al. 2022) and their value in epigenetic modulation in neuroblastoma and as biomarkers is undetermined.
TEs are active in human embryonic stem cells and, through functional hypomethylation, participate in the activation of gene expression in a mechanism compatible with TET1 activity and 5-hmC deposition (Ma et al., 2022). Retrotransposons such as HERV-H demarcate topologically associated domains (TADs) in chromatin, which maintain the cellular pluripotency state (Zhang et al., 2019). These observations agree with the hypothesis of induction of a stem cell (dedifferentiated) state in hypoxia-activated ADRN cells, involving repetitive or transposable elements, which lead to the expression of MES genes (Figure 5). Important biomarkers can be identified among TE elements, since they represent a high fraction of the genome composition. Thus, the detection of repetitive elements of the genome in cfDNA isolated by liquid biopsy can help in the management of cancer patients (Figure 2).
Studies from our group have identified 5-hmC deposition patterns (Figure 3A) in tumors (Applebaum et al., 2019) and liquid biopsy cfDNA (Applebaum et al., 2020) using the nano 5-hmC-seal method developed by Dr. Chuan He from the University of Chicago (Figure 3B). We propose that the TET1 enzyme, activated by hypoxia, causes oxidation of methylated DNA sequences and formation of 5-hydroxymethyl-cytosine (5-hmC) hypomethylation regions (Figure 3A). Thus, in the present project, sequencing the 5-hmC marker in Brazil will aid in patient management, as done in the USA (Chennakesavalu, Moore, Chaves, et al. 2024). We postulate that it will be possible to quantify ADRN and MES phenotypes in tumors in Brazilians, as we have recently done (Vayani, Chaves et al. 2023).
Mitochondria are important organelles in cellular metabolism and the citric acid cycle and play a role in pluripotency and differentiation at the embryonic stage (Carey et al. 2015; Hensley, Wasti, and DeBerardinis 2013; Qing et al. 2012; Yoo et al. 2020). Mitochondrial DNA is small and maternally derived, with approximately 16,000 base pairs. Mitochondria provide the cellular supply of ATP for the nervous system, playing an important role in diseases of this system, such as neuroblastoma and neurodegenerations. In neuroblastoma, John Maris’ group conducted case-control studies in Caucasians from the USA and observed that mitochondrial haplogroups (genetic variants) are associated with a reduced risk of the disease (Chang et al. 2020). The same group observed that mitochondrial single nucleotide polymorphism (SNP), rs2853493, is associated with the risk of neuroblastoma, including integrateing the expression of the mitochondrial cytochrome B gene, MT-CYB (Chang et al. 2022).
In neurodegenerative diseases, Tranah and collaborators (2012) observed that in elderly Caucasians, different mitochondrial haplogroups present an increased risk of developing dementia and cognitive decline (Tranah et al. 2012). The group reported that in African-Americans, haplogroup L1 represents a greater risk of developing dementia when compared to haplogroup L3, which is more common in this ethnic-racial group. Also for haplogroup L3, the SNP p.V193I, a substitution in the ND2 gene, was associated with increased levels of amyloid plaque, a phenotype of Alzheimer’s disease (Tranah et al. 2014). Neuroblastoma arises during sympathetic neurogenesis. Neurogenesis and nervous system development pathways are suppressed in mitochondrial haplogroups in neuroblastoma and this can be associated with the underlying mechanism of reduced risk associated with mitochondrial haplogroups investigated in Chang et al. (2020). Due to its importance for dementia and neuroblastoma, computational methods need to be developed to quantify genetic risk associated with mitochondrial sequences, especially in non-US White populations, for which there is scarse data available. The choice of mitochondrial sequences considers an existing cross-talk between mitochondrial metabolism and activation of repetitive genomic elements (Baeken, Moosmann, and Hajieva 2020; Larsen et al. 2017; Stoccoro and Coppedè 2021; Lopes 2020; Bravo et al. 2020) that has implications in the role of mitochondrial haplogroups in activation of the immune system, as discussed by the work of Chang et al. (2020).
To allow contrasting demographic data on nervous system diseases in the context of the public health system of Brazil, we propose exploring mortality data from neuroblastoma and neurodegenerations in the public database of the Informatics Department of the Ministry of Health of Brazil (DATASUS). This will allow the development of computational tools capable of estimating genetic risk for different Brazilian racial groups, a long necessary approach to the health system of Brazil, based on the identification of mitochondrial variants that confer a higher risk of neuroblastoma and neurodegenerations, using data curated from the literature (Tranah et al. 2012; Tranah et al. 2014; Chang et al. (2020); Chang et al. 2022).
Management and decision-making processes in health systems occur with development of tools capable of processing and analyzing data generated by such systems, investigating the dynamics that affect mortality in various morbidities, such as those that affect the nervous system, which represent a significant proportion of Government expenditures. Recently, private interest of pharmaceutical companies leveraged the power of the biobank of the United Kingdom to develop machine learning and artificial intelligence tools to predict diseases and phenotypes using the information contained in the UK Biobank, as presented in the study carried out by Garg et al. (2024). In Brazil, the Federal Constitution of 1988 established the Unified Health System (in Portuguese, Sistema Único de Saúde - SUS), and after that, the SUS Department of Informatics (DATASUS) was created to organize the data collected by SUS (Saldanha, Rocha Bastos, and Barcellos, 2019). More recently, Programa Genomas Brasil was implemented with the goal to sequence 100,000 nationals to inform precision medicine policies in the public health system. Our group verified the persistence of racial inequality in the risk and survival of neuroblastoma (Chennakesavalu et al. 2023). Due to the genetic admixture present in Brazil since the beginning of colonization, a considerable part of the Brazilian population is of mixed race or black, which raises questions about the incidence, risk and treatment of neuroblastoma patients in Brazil considering their racial identification. Brazil is marked by profound socioeconomic inequalities related to the ethnic-racial origin of the population, which include but are not limited to digital literacy (Araújo da Silva and Behar 2019), the use of programming languages in genomic sciences (Sano et al. 2024; Vera-Choqqueccota et al. 2024), and racial inequality in health. The latter was recently identified by the Longitudinal Study of Adult Health (ELSA-Brazil), a study conducted with public workers in Brazilian public institutions (ELSA Brazil 2023). The creation of SUS-linked databases to conduct genomic research must therefore take into account the integrate of Brazilian genetic admixture on both access to digital literacy and the health of Brazilians themselves. Considering the requirement of public resources for genomic studies and genomic literacy of researchers and scientists in Brazil, we propose the Geralda framework as a concept to guide the integration of genomic information into the demographic database of SUS, as presented in the Materials and Methods section.
The R package Microdatasus was used to access mortality data on neuroblastoma and neurodegenerative diseases as described by Freitas Saldanha et al. (2019).
Mitochondrial haplogroups were classified using Haplogrep3 as described in Schonherans et al. (2023). This classification algorithm generates a csv file that can be used with other R packages to understand the genetic risk associated with neuroblastoma and neurodegenerations for each haplogroup.
Risk was estimated and ploted using geobr as described by Pereira and Goncalves (2024). Geobr is a computational package to download official spatial data sets of Brazil. The package includes a wide range of geospatial data in geopackage format (like shapefiles), available at various geographic scales and for various years with harmonized attributes, projection and topology. This allows us to achieve a spatial-geographic organization of the data provided by the DATASUS department of the Ministry of Health.
The adoption of genomic information and epigenetic markers, such as the 5-hmC marker in neuroblastoma, into the SUS system involves challenges, which necessarily include activities to support digital literacy and the use of computer programming technologies in genomics throughout the country. Thus, this project includes, in the methodological part, collaborative genomic research between CIDACS in Brazil and my doctoral and postdoctoral institutions in the USA, to mediate the teaching of computer programming languages for genomic research and the construction of the database for the application of machine learning to demographic data from the SUS. CIDACS proposed the harmonization of databases of social and health indicators, creating the Cohort of 100 million Brazilians, making important contributions to national health and epidemiology (Barreto et al. 2021). Integrating genomic information databases into the pipelines that allow investigations of the social and demographic data in Brazil can enrich and improve public policies of the nacional public health system of Brazil. This also has a potential to generate protocols for the use of machine learning and artificial intelligence in disease classification in the health system.
Figure 1: Framework of the GERALDA pipeline. Starting with fasta or fastq sequences, samples are aligned to the reference genome. Once VCF files are produced out of each sample, custom scripts are used to extract the genotypes of interest that will be used as features to inform the machine learning algorithm to classify discrete categories. Each haplotype thus identified is then used to label a racial group in the Microdatasus dataframe. This information is then used to estimate the genetic risk of each racial group in the DATASUS dataframe.
Historical genealogy records can be used to understand the ancestry of individuals. In Brazil several family websites maintain records that can be used to retrieve ancestry information about Brazilian nationals. Using an Excel-compatible text file, a R script can be used to access the text file information of the ancestry of Brazilian familiies, as in this code chunk, where we explore information about descendants of Diogo Álvares Correia Caramuru and Catarina Paraguassu, who lived in Cidade do Salvador da Bahia, the first capital of Brazil, during the 1500s:
library("ggenealogy")
library("dplyr")
library("readxl")
brGeneal <- read_excel("../../ReComBio Scientific/geraldo/data/brGeneal.xlsx",
sheet = "Sheet9")
brIG <- dfToIG(brGeneal)
plotAncDes("Francisco Dias d'Ávila", brGeneal, mAnc = 3, mDes = 7, vCol = "blue")
Figure 2: Genealogical record of Brazilian families from the state of Bahia. Highlighted is Francisco Dias D’Ávila, great-grand child of Diogo Álvares Correia, known in Brazil as Caramuru, and Catarina Paraguassu, a Native Brazilian tupinambá who was baptized Catholic. Caramuru and Paraguassu are considered the first Portuguese-Native couple married in Brazil by the Catholic Church. Estimations on the number of descendents of their marriage reach 50 million Brazilians. Francisco Dias D’Ávila names the city of Dias D’Ávila, located in the State of Bahia, showing the role of miscegenation in the formation of the nacional identity, genetics and genomics of Brazil.
Race is not recognized by science as a valid system to classify human groups (anymore), although scientific racism testifies of the involvement of science in racist ideologies. Descendants of the Dias D’Ávila family joined the efforts of luso-Brazilians to expel the Dutch in the Dutch invasion of Brazil. Francisco Dias D’Ávila was son, grandson and great grandson of Portuguese men (Diogo Álvares Correia Caramuru, Vicente Dias de Beja and Diogo Dias de Béja - Figure 2), illustrating the phenomenon of sexual asymmetry in the racial identity of the genomes of Brazilians. Francisco Dias D’Ávila probably had enough European ancestry to be a Southern European-looking white man who carried the mitochondrial ancestry of Isabel de Ávila, whose genealogical records indicate to be a daughter of Tomé de Souza with the Native Brazilian Francisca Rodrigues (Figure 2). Carrying a Native Brazilian mitochondrial sequence and Southern European patrilineal white looks, descendants of the Dias D’Ávila family migrated to Pernambuco, northern of Bahia. While Pernambuco was occupied by the Dutch in the Dutch invasion of Brazil and clearly carrying different Southern European compared to Northern European looks, the Dias D’Ávila white/light skin descendants may have taken the Jewish Ashkenazi mitochondrial ancestry inherited from the Dutch to other locations in Northeastern Brazil that we find in the mitochondrial sequences of the State of Pernambuco (Table 1 and Figure 8).
Table 1: Identification of mitochondrial haplogroups in self-declared white Brazilians, identified using Haplogrep 3 (Schönherr, Weissensteiner, and Kronenberg 2023). Samples present haplogroups J and K. These genotypes were related to the risk of dementia in the 2012 Tranah study in individuals of European ancestry. In individuals of African ancestry, haplogroup L1 is identified, which presents an increased risk of developing dementia, according to Tranah 2014. Also according to the 2014 Tranah study, the most common haplogroup among people of African ancestry, haplogroup L3, which is also observed in this sample from Brazil with 4 individuals (4 counts), presents higher levels of amyloid plaque deposition. This suggests that these individuals represent a risk group for the development of dementia among Brazilians.
| Region | |||
|---|---|---|---|
| Northeast | South | Southeast | |
| Origin | |||
| African | 18 | 30 | |
| Amerindian/Asian | 7 | 15 | |
| European | 14 | 17 | |
| #Total cases | 39 | 17 | 45 |
Specifically, the roles of England and the United States in social darwinism and the phenotypes of (Northern) European groups that were popular in the histories of Brazil and the former Iberian colonies in America. Therefore Rambaran-Olm and Wade (2021) remind us to not forget the historical records of scientific racism, social darwinism and eugenic practices that European countries and their colonies have practiced believing in the idea of a homogeneous “European or White Race”. In Brazil, this idea has caused even racial self-classification to be a social and political matter. Although it is acknowleged that the country has a historical genetic admixture involving the Portuguese and other European, African and Native American populations, Chor and Araujo Lima (2005) report a historical struggle for Brazil with racial inequality in access to health. Racial classification systems have been difficult to implement in Brazil because of the known genetic admixture started with the arrival of the first Portuguese. Difficulty in racial classification can be perceived in the day-to-day life. The term “afroconveniência”, for example, which is difficult to translate into English, was created in Brazil to describe people too “light-skinned” to claim African ancestry according to Silva et al. (2023). Among the systems for racial classification, ancestry, self-identification and genealogical records maintained by government for identification have been proposed. In the USA, where the government has used racial identification for immigratory, marriage and citizenship policies, the white identity was constructed arround the immigration from England and other Northern European countries. In Brazil, construction of the white identity followed a different pattern where not genealogical records but rather, European phenotypes of the Southern European colonization primarialy of Portuguese but also importantly, Italian and Spanish ancestries, constructed the white identity. According to Chor and Araujo Lima (2005), IBGE adopted self-declaration for racial classification purposes.
Biological, genetic or genomic ancestries can inform the dinamics of ancient human population migration as well as the interaction of the human populations with the environment Tranah et al. (2012). The genomic ancestry investigated using mitochondrial sequences can also inform the genetic risk for diseases of the nervous system such as neurodegenerations and neuroblastoma. Ethnical and religious groups such as Jews use mitochondrial sequences to estimate matrilineal ancestry (Feder et al. (2008)) and we chose mitochondrial sequences because of the role of mitochondria in metabolism reprogrammation and de-differentiation, which we identified in neuroblastoma to be associated with tumor progression (Chaves et al., 2024 - in preparation). To investigate the ancestry of the self-declared white population in Brazil, we used Haplogrep3 to classify the matrilinear lineage of self-declared white Brazilians aiming to quantify genetic risks for variants known to affect the nervous system (Table 1).
Samples depicted in Table 1 derive from Haplogrep 3. They can be visualized as the number of sequences per continent of origin. The haplogroups.regions object is used to investigate the risk of disease in the nervous system per regions of Brazil, based on the genotypes of the mitochondrial DNA sequences for each of the large regions. To understand the risk of nervous system diseases we calculate the incidence of mitochondrial variants in the large regions. The code counts each of the haplogroups in Brazil.
This approach allows calculation of the rate of incidence of each haplogroup and the frequency of the genetic variants in the Brazilian large regions. N is the total number of samples and we can calculate the frequency dividing the n total number haplogroups by total number of samples N in a region:
In this visualization, we include a SNP (single nucleotide polymorphism) in the last column:
| SampleID | Genotype | Origin | Region | Found_Polys |
|---|---|---|---|---|
| AF243627 | A2 | Amerindian/Asian | Northeast | 152C! 16111T 16126C 16223T 16259T 16290T 16319A 16362C |
| AF243628 | G1 | Amerindian/Asian | Northeast | 16223T 16325C 16362C |
| AF243629 | B4 | Amerindian/Asian | Northeast | 16189C 16217C |
| AF243630 | B2 | Amerindian/Asian | Northeast | 16189C 16217C 16249C 16312G 16344T |
| AF243631 | A+ | Amerindian/Asian | Northeast | 16223T 16290T 16319A 16362C |
| AF243632 | C1 | Amerindian/Asian | Northeast | 16223T 16298C 16325C 16327T 16362C |
| AF243633 | M7 | Amerindian/Asian | Northeast | 16223T 16295T 16362C |
| AF243634 | L1 | African | Northeast | 1438G! 15301A! 16126C 16129A! 16187T 16189C 16223T 16264T 16270T 16278T 16293G 16311C |
| AF243635 | L3 | African | Northeast | 16176T 16223T 16327T |
| AF243636 | M4 | African | Northeast | 6131G! 16223T 16294T 16294T |
| AF243637 | L3 | African | Northeast | 16223T 16327T |
| AF243638 | L0 | African | Northeast | 73G! 146C! 182T! 195C! 263G! 15301A! 16129A 16148T 16168T 16172C 16187T 16188G 16189C 16223T 16230G 16278T! 16311C 16320T |
| AF243639 | L2 | African | Northeast | 150T! 182T! 16189C 16192T 16223T 16278T 16294T 16309G 16311C! |
| AF243640 | L3 | African | Northeast | 16124C 16223T |
| AF243641 | L3 | African | Northeast | 16185T 16223T 16327T |
| AF243642 | L2 | African | Northeast | 16223T 16264T 16278T 16311C! |
| AF243643 | L2 | African | Northeast | 150T! 182T! 16189C 16223T 16225T 16234T 16278T 16294T 16309G 16311C! |
| AF243644 | L1 | African | Northeast | 15301A! 16129A 16187T 16189C 16214T 16223T 16265C 16278T 16291T 16294T 16311C 16360T |
| AF243645 | L1 | African | Northeast | 15301A! 16129A 16187T 16189C 16223T 16265C 16278T 16286G 16294T 16311C 16360T |
| AF243646 | L2 | African | Northeast | 150T! 182T! 16223T 16278T 16294T 16309G 16311C! |
| AF243647 | L0 | African | Northeast | 73G! 146C! 182T! 195C! 263G! 15301A! 16129A 16148T 16168T 16172C 16187T 16188G 16189C 16223T 16230G 16278T! 16311C 16320T |
| AF243648 | L3 | African | Northeast | 16124C 16223T |
| AF243649 | L3 | African | Northeast | 16172C 16223T 16327T |
| AF243650 | L2 | African | Northeast | 150T! 182T! 16223T 16278T 16294T 16309G 16311C! |
| AF243651 | L3 | African | Northeast | 16129A 16209C 16223T 16292T 16295T 16311C |
| AF243652 | H1 | European | Northeast | 16309G |
| AF243653 | H1 | European | Northeast | 16362C |
| AF243654 | J | European | Northeast | 16069T 16126C |
| AF243655 | HV | European | Northeast | 16234T 16311C 16362C |
| AF243656 | H1 | European | Northeast | 16075C 16189C 16356C |
| AF243657 | K1 | European | Northeast | 16093C 16224C 16311C 16319A |
| AF243658 | H7 | European | Northeast | 16221T |
| AF243659 | K | European | Northeast | 16224C 16311C |
| AF243660 | H2 | European | Northeast | |
| AF243661 | H1 | European | Northeast | 16189C 16356C |
| AF243662 | T2 | European | Northeast | 16126C 16294T 16296T 16304C |
| AF243663 | H2 | European | Northeast | 16189C |
| AF243664 | V7 | European | Northeast | 16153A 16298C |
| AF243665 | H3 | European | Northeast | 16293G |
| AF243666 | L3 | African | Southeast | 750G! 16223T 16265T |
| AF243667 | L2 | African | Southeast | 16111A 16145A 16184T 16223T 16239T 16278T 16292T 16311C 16355T |
| AF243668 | L3 | African | Southeast | 750G! 16223T 16265T |
| AF243669 | L1 | African | Southeast | 15301A! 16086C 16129A 16187T 16189C 16223T 16241G 16274A 16278T 16291T 16293G 16294T 16311C 16360T |
| AF243670 | L3 | African | Southeast | 16185T 16209C 16223T 16327T |
| AF243671 | L2 | African | Southeast | 150T! 195C! 16223T 16224C 16278T 16311C! |
| AF243672 | L2 | African | Southeast | 16223T 16264T 16278T 16311C 16311C |
| AF243673 | L0 | African | Southeast | 73G! 146C! 182T! 185A! 195C! 263G! 15301A! 16129A! 16148T 16172C 16187T 16188G 16189C 16223T 16230G 16278T! 16311C 16320T |
| AF243674 | M5 | African | Southeast | 16223T 16278T 16294T |
| AF243675 | L1 | African | Southeast | 15301A! 16071T 16129A 16145A 16187T 16189C 16213A 16223T 16234T 16265C 16278T 16286G 16294T 16311C 16360T |
| AF243676 | U6 | African | Southeast | 16172C 16189C 16219G 16278T |
| AF243677 | U6 | African | Southeast | 16172C 16189C 16219G 16278T 16362C |
| AF243678 | L4 | African | Southeast | 5460A! 16223T 16293T 16311C 16355T 16362C |
| AF243679 | L2 | African | Southeast | 16114A 16129A 16213A 16223T 16278T 16311C! |
| AF243680 | L1 | African | Southeast | 1438G! 15301A! 16126C 16129A! 16187T 16189C 16223T 16264T 16270T 16278T 16311C |
| AF243681 | L4 | African | Southeast | 5460A! 16223T 16293T 16311C 16355T 16362C |
| AF243682 | X1 | African | Southeast | 16104T 16189C 16223T 16278T! |
| AF243683 | L1 | African | Southeast | 195C! 2283T! 7055G! 15301A! 16104T 16129A! 16163G 16187T 16189C 16223T 16278T 16293G 16294T 16311C 16360T |
| AF243684 | L3 | African | Southeast | 10398G! 16185T 16223T 16327T! |
| AF243685 | L0 | African | Southeast | 73G! 146C! 152C! 182T! 195C! 263G! 15301A! 16093C 16129A 16148T 16168T 16172C 16187T 16188G 16189C 16223T 16230G 16278T 16278T 16293G 16311C 16320T |
| AF243686 | L3 | African | Southeast | 16185T 16223T 16327T |
| AF243687 | L1 | African | Southeast | 15301A! 16129A 16187T 16189C 16223T 16278T 16293G 16294T 16311C 16360T |
| AF243688 | L1 | African | Southeast | 195C! 7055G! 15301A! 16129A 16163G 16187T 16189C 16209C 16223T 16278T 16293G 16294T 16311C 16360T |
| AF243689 | L1 | African | Southeast | 15301A! 16086C! 16129A! 16189C 16223T 16278T 16293G 16294T 16311C 16360T |
| AF243690 | L3 | African | Southeast | 16223T 16320T |
| AF243691 | L3 | African | Southeast | 16172C 16189C 16223T 16320T |
| AF243692 | M5 | African | Southeast | 16223T 16278T 16294T |
| AF243693 | L3 | African | Southeast | 16172C 16189C 16223T 16311C 16320T |
| AF243694 | L1 | African | Southeast | 198T! 10398G! 15301A! 16129A 16187T 16189C 16223T! 16278T 16293G 16294T 16311C 16360T |
| AF243695 | L3 | African | Southeast | 16172C 16189C 16223T 16320T |
| AF243696 | A+ | Amerindian/Asian | Southeast | 16189C 16223T 16290T 16319A 16362C |
| AF243697 | A2 | Amerindian/Asian | Southeast | 152C! 16097C 16098G 16111T 16223T 16290T 16319A 16362C |
| AF243698 | C1 | Amerindian/Asian | Southeast | 16223T 16325C 16327T |
| AF243699 | A2 | Amerindian/Asian | Southeast | 152C! 16111T! 16126C 16223T 16278T 16290T 16319A 16362C |
| AF243700 | A2 | Amerindian/Asian | Southeast | 152C! 16111T 16192T 16223T 16290T 16319A 16362C |
| AF243701 | A8 | Amerindian/Asian | Southeast | 16223T 16242T 16290T 16319A |
| AF243702 | B4 | Amerindian/Asian | Southeast | 16189C 16217C |
| AF243703 | B4 | Amerindian/Asian | Southeast | 16189C 16217C |
| AF243704 | G1 | Amerindian/Asian | Southeast | 16223T 16325C 16362C |
| AF243705 | C1 | Amerindian/Asian | Southeast | 16223T 16298C 16325C 16327T |
| AF243706 | A2 | Amerindian/Asian | Southeast | 152C! 16111T 16223T 16290T 16319A 16362C |
| AF243707 | B4 | Amerindian/Asian | Southeast | 16189C 16217C |
| AF243708 | C1 | Amerindian/Asian | Southeast | 16223T 16298C 16325C 16327T |
| AF243709 | B4 | Amerindian/Asian | Southeast | 16189C 16217C |
| AF243710 | A2 | Amerindian/Asian | Southeast | 152C! 16111T 16189C 16223T 16290T 16319A 16362C |
| AF243780 | H5 | European | South | 16304C |
| AF243781 | H1 | European | South | 16162G |
| AF243782 | U5 | European | South | 16144C 16189C 16192T! 16270T |
| AF243783 | T2 | European | South | 16126C 16153A 16294T 16296T |
| AF243784 | H2 | European | South | |
| AF243785 | J1 | European | South | 16069T 16126C 16261T |
| AF243786 | H2 | European | South | 16124C 16354T |
| AF243787 | H7 | European | South | 16213A |
| AF243788 | T2 | European | South | 16126C 16147T 16294T 16296T 16297C 16304C |
| AF243789 | H3 | European | South | 16093C |
| AF243790 | X2 | European | South | 16189C 16223T 16248T 16278T |
| AF243791 | HV | European | South | 16298C |
| AF243792 | U5 | European | South | 16189C 16192T! 16270T |
| AF243793 | K1 | European | South | 16224C 16311C 16319A |
| AF243794 | U7 | European | South | 16309G 16318T |
| AF243795 | J1 | European | South | 2706G! 16069T 16126C 16222T |
| AF243796 | R0 | European | South | 16126C 16362C |
| Region | |||
|---|---|---|---|
| Northeast | South | Southeast | |
| Genotype | |||
| A+ | 1 | 1 | |
| A2 | 1 | 1 | |
| A8 | 1 | ||
| B2 | 1 | ||
| B4 | 1 | 1 | |
| C1 | 1 | 1 | |
| G1 | 1 | 1 | |
| H1 | 1 | 1 | |
| H2 | 1 | 1 | |
| H3 | 1 | 1 | |
| H5 | 1 | ||
| H7 | 1 | 1 | |
| HV | 1 | 1 | |
| J | 1 | ||
| J1 | 1 | ||
| K | 1 | ||
| K1 | 1 | 1 | |
| L0 | 1 | 1 | |
| L1 | 1 | 1 | |
| L2 | 1 | 1 | |
| L3 | 1 | 1 | |
| L4 | 1 | ||
| M4 | 1 | ||
| M5 | 1 | ||
| M7 | 1 | ||
| R0 | 1 | ||
| T2 | 1 | 1 | |
| U5 | 1 | ||
| U6 | 1 | ||
| U7 | 1 | ||
| V7 | 1 | ||
| X1 | 1 | ||
| X2 | 1 | ||
| #Total cases | 22 | 13 | 14 |
These results suggest that the identity of self-declared white individuals in Brazil is not limited to those of strict European matrilinear ancestry. This is in potential agreement with the ideology of racial democracy proposed by Gilberto Freire. It also suggests that individuals that self-identify as white, as well as public governmental policies should consider the genetic risk associated with African variants in individuals that self-identify as white. At this point it is not possible to establish the cause-effect scenario but higher mortality among self-declared white individuals due to neuroblastoma is predominant in white individuals, either because self-declared white individuals use the public health system more often than non-whites (an evidence of structural racism) or because the genetic risk variants are as well present in self-declared white individuals. We calculate an incidence of 32% for haplogroup L3 in the Northeast region (Table 2).
L3 haplogroup is not present among self-declared white individuals in the South of Brazil. The observation of mitochondrial haplogroups of African ancestry in self-declared white Brazilians (Figure 7) is consistent with Fridman’s (2014) observation of 35% African matrilineal lineage in self-declared white individuals in Brazil. An estimated 80% European ancestry was calculated for the Y chromosomal lineage of Brazilians. Using the weighted mean proportions technique, Souza et al. (2019) calculated 68.1%, 19.6% and 11.6% for the parental ancestries of European, African and Native American ancestries for the Brazilian population. On the other hand, if haplogroups L1 and L3 confer a greater risk to individuals identified as white for diseases of the nervous system, for neuroblastoma, haplogroup K, of European origin, presents a reduced risk of according to Chang 2020, suggesting that Brazilians who self-declare as white and who have the European mitochondrial haplogroup K genotype are less susceptible to neuroblastoma than individuals of African matrilineal lineage (Figure 6).
To apply the model described in Chaves et al. (2024) to genomics of diseases of the nervous system, a collaborative genomic research is proposed. We accessed neuroblastoma mortality data from Brazilians, using the Microdatasus R package. We found a predominance of deaths of self-declared white individuals in the first decade of the 2000s (Figure 2). The prevalence of white individuals in neuroblastoma mortality data in Brazil may be attributed to the ideology of “whitening” (Pena et al. 2011) the Brazilian racial identity (Mitchell 2022), largely attributed to the gender asymmetry and sexism in interracial relationships in Brazil (Pena 2007). After 2010, a decrease in the proportion of mortality of self-declared white individuals was observed (Figure 2). In Figure 2, we plot the mortality rate of neuroblastoma in Brazil using the dados_nb_appended_melted object that was saved somewhere else and the code below.
library(ggplot2)
nb_mortality_df <- readRDS("../../ReComBio Scientific/geraldo/data/dados_nb_appended_melted.rds")
p_nb <- nb_mortality_df %>%
ggplot(aes(x = year, y = value,
fill = raca_cor_factor)) +
geom_bar(position="fill", stat="identity")
ggplotly(p_nb)
Figure 3: Annual mortality due to neuroblastoma in a sample of the Brazilian population between 2000-2015, made available by the SUS Information Technology Department, DATASUS. Mortality is calculated using the R language through the Microdatasus library. Artwork by @allison_horst
Since the TET1/5-hmC hypomethylation genomic model proposed in Chaves et al., 2024 can only be applied to Brazilians after sequencing 5-hmC from cfDNA of Brazilian patients, the current MVP of the machine learning tool proposed in this project is a proof of concept and analyzes mitochondrial DNA of Brazilians published by Alves-Silva et al., 2000. The prevalence of self-declared white individuals in the neuroblastoma mortality data (Figure 3) is in line with the African ancestry (indicated in purple) of haplogroups L1 and L3 in self declared white individuals in Brazil and a higher risk associated of the incidence of nervous system diseases in these haplogroups according to Tranah (Tranah et al. 2012; Tranah et al. 2014) (Figure 2).
Because we identified genetic variants associated with higher risk of diseases in the nervous system in the Brazilian mitochondrial sequences, we decided to look at the incidence of these mitochondrial haplogroups per large geographic region of Brazil, using the geographic information present in the Alves-Silva et al. (2000) study. To do that we accessed geographic information about the neuroblastoma death rates of the self-identifying racial groups in the geographic regions of Brazil.
and mortality of neuroblastoma by race:
data_nb_2013_estados_perct <- readRDS("../../ReComBio Scientific/geraldo/data/data_nb_2013_estados_perct.rds")
Then we uploaded geographic (state) data using the read_state function from the geobr R package as follows:
# read all states
states <- read_state(
year = 2019,
showProgress = FALSE
)
To integrate genomic and geographic information, we joined the states (from geobr) dataframe and the data_estados_brancos_perct (total mortality from Microdatasus) dataframe:
and states_nb_2013 (neuroblastoma mortality from Microdatasus) databases:
Because the main ethnico-racial group affected by neuroblastoma mortality in Brazil was the self-reported white group, to begin looking into the spatial and geographic information of the demographic data stored by DATASUS, we visualized the mortality rate of self-declared white individuals in the states of Brazil (data not shown). We obtained the number of self-declared white individuals in each state. With that number, we can estimate the number of self-declared white individuals reported as passing away by the SUS health system in 2014. We observed as expected that proportionally, more self-reported white individuals passed away in Southern Brazil than any other of the large geographic regions (data not shown).
Then we ploted the mortality rate of neuroblastoma in the white race in Brazil:
ggplot() +
geom_sf(data=states_nb_2013, aes(fill=branca_perc),
color= "black", size=.15) + ## Color here is the line of the border
labs(subtitle="", size=8) +
scale_fill_distiller(palette = "Reds", name="Mortality\nRate", direction=+1,
limits = c(0,1)) +
theme_minimal() #+no_axis
Figure 4: Mortality rates of self-declared white individuals in the large region states of Brazil. This should be contrasted to the neuroblastoma and neurodegeneration mortality rates in the large region states, to evaluate if the mortality rates are equal for all enthnical-racial groups.
We observe that the states of Amazonas and Tocantins in the North, Maranhão in Northeast and Rio GRande do Sul in the South, show the highest incidence of death of white children of neuroblastoma.
We now plot the mortality rate of neuroblastoma in the black race in Brazil:
ggplot() +
geom_sf(data=states_nb_2013, aes(fill=preta_perc),
color= "black", size=.15) + ## Color here is the line of the border
labs(subtitle="", size=8) +
scale_fill_distiller(palette = "Reds", name="Mortality\nRate", direction=+1,
limits = c(0,0.3)) +
theme_minimal() #+no_axis
Figure 5: Mortality rates of self-declared white individuals in the large region states of Brazil. This should be contrasted to the neuroblastoma and neurodegeneration mortality rates in the large region states, to evaluate if the mortality rates are equal for all enthnical-racial groups.
We see that Piaui and Espírito Santo show the highest incidence of death of black children by neuroblastoma.
And the mortality rate of neuroblastoma in the yellow race in Brazil:
ggplot() +
geom_sf(data=states_nb_2013, aes(fill=amarela_perc),
color= "black", size=.15) + ## Color here is the line of the border
labs(subtitle="", size=8) +
scale_fill_distiller(palette = "Reds", name="Mortality\nRate", direction=+1,
limits = c(0,1)) +
theme_minimal() #+no_axis
Figure 6: Mortality rates of self-declared white individuals in the large region states of Brazil. This should be contrasted to the neuroblastoma and neurodegeneration mortality rates in the large region states, to evaluate if the mortality rates are equal for all enthnical-racial groups.
We see that the states of Alagoas, Pará and Piauí show the highest incidence of death of yellow children of neuroblastoma.
Having established where the highest incidence of self-reported white individuals mortality rate is, we begin accessing the incidence of the mitochondrial haplogroups that confer protection or risk of diseases in the nervous system. In Table 3 we observe the incidence of each haplogroup identified in the mitochondrial sequences isolated from self-declared white Brazilians (Table 3). Column Region in that table, can be used to merge the genotype information to the demographic information contained in this other table extracted using the Microdatasus R package.
dados_estados_nb_2013 <- readRDS("../../R/R journal/data/dados_estados_nb_2013.rds")
dim(dados_estados_nb_2013)
[1] 302 8
# kable(dados_estados_nb_2013, caption="dados_estados_nb_2013") %>%
# kable_styling("striped", full_width = F, font_size = 12) %>%
# scroll_box(width = "100%", height = "600px")
head(dados_estados_nb_2013)
CONTADOR RACACOR DTOBITO CAUSABAS DTNASC IDADE UF Region
6222 6222 1 23022013 C749 30122003 409 TO North
31026 24155 1 16022013 C749 09022012 401 SP Southeast
44075 37204 <NA> 09042013 C749 25102009 403 SP Southeast
44097 37226 2 08082013 C749 03042012 401 SP Southeast
46745 39874 1 10012013 C749 22031953 459 SP Southeast
48051 41180 1 19012013 C749 02032011 401 SP Southeast
Note that both Table 3 and dados_estados_nb_2013 have a column named Regions, depicting the large Regions of Brazil. A column similar to this column can be used to store information about State, Municipality, City and local information about the health unit that is serving the patient in the public health system.
After estimating the mortality of individuals that identify with the white race per region in Brazil (Figure 4), we begin estimating the incidence of the mitochondrial haplogroups by geographic regions. Toparslan et al. (2020) proposed an R workflow for phylogenetic analysis and visualizations of mitochondrial sequences. Chang et al. (2020), Tranah et al. (2012), Tranah et al. (2014), Feder et al. (2008) and Kofler et al. (2009) have reported these haplogroups to associate with the genetic risk for diseases of the nervous system. We now estimate which mitochondrial lineages are exposed to the highest risks along the territory of Brazil.
This haplogroup is predominant in Southern Brazil, with significant presence in the Northeast. Haplogroup J was reported by Tranah et al. (2012) to be associated with cognitive impairment. Feder et al. (2008) identified that this haplogroup also associates with type 2 diabetes in Ashkenazi Jews, and we identified (Chaves et al. (2019)) a genetic mechanism that causes this disease to be more frequent in people that have neurodegenerative diseases.
Figure 7: Estimation of incidence of mitochondrial haplogroup J in populations of the large regions Brazil.
This haplogroup was found predominant in Southern and Northeasthern Brazil in this study. Chang et al. (2020) reported that haplogroup K protects against neuroblastoma and is associated with protection against the high risk neuroblastoma disease, the most aggressive form of the disease. It is possible that this haplogroup is associated with increased inflammatory response and T-cell infiltration in hot neuroblastoma tumors via mitochondrial reprogramming of metabolism in cancer and the immunological cells. Of note, we identified the highest incidence of the haplogroup K in the Brazilian population of the Northeast, nationally known for the arrival of the Portuguese in 1500 in Porto Seguro - Bahia. Considering the recent Italian immigration to the South and Southeast, responsible for a considerable amount of the European immigration to Brazil, it is unexpected that this protective haplogroup to be so highly present in the Northeast region. One possible explanation for this observation is the immigration of people from the Netherlands to the state of Pernambuco, also known as the Dutch invasions of Brazil.
Figure 8: Estimation of incidence of mitochondrial haplogroup K in populations of the large regions Brazil.
This haplogroup is predominant in Southeastern and Northeastern Brazil. Haplogroup L3 was reported by Tranah et al. (2014) to be associated with cognitive impairment in African Americans. It is possible that this haplogroup is associated with increased mortality by neuroblastoma in self-declared white Brazilians of mitochondrial African ancestry.
Figure 9: Estimation of incidence of mitochondrial haplogroup L3 in populations of the large regions Brazil.
Haplogroup T was detected in samples from Northeastern and Southern Brazil. According to a study by Kofler et al. (2009), mitochondrial DNA haplogroup T is associated with coronary artery disease and diabetic retinopathy. Chang et al. (2020) also reported association between haplogroup T and neuroblastoma.
Figure 10: Estimation of incidence of mitochondrial haplogroup T in populations of the large regions Brazil.
The Extreme South of Bahia is the region of arrival of Portuguese sailors in Brazil in 1500 and because of that, the oldest region occupied by Portugal in the Americas. The largest cities in the region currently are Teixeira de Freitas, Porto Seguro and Eunápolis. Itamaraju is the city where Mount Pascoal is located. The school of Medicine is located at Universidade Federal do Sul da Bahia (UFSB), in Teixeira de Freitas while the group of computational biology locates in Porto Seguro. Occupation of the Extreme South of Bahia occurred since the arrival of Portuguese sailors with migrations from the capitaincies of Bahia de Todos os Santos, Minas do Ouro (Minas Gerais) and Espírito Santo. Because of slavery and the ideology of whitening in Brazil, genealogical records of the occupation of the extreme South of Bahia by black and Native Brazilians is scarse. Genealogical records of white or Portuguese occupation of the geographical space are documented, with the genealogical records of the Tourinho family, the first donatário of the Capitaincy of Porto Seguro. Records of the occupation of the cities of Alcobaça and Caravelas in the South of the Capitaincy of Porto Seguro is well documented in books writen by Fábio Said Said (2024). In Teixeira de Freitas there exists the record of the Cascata Farm, founded in the 19th century and titulada in 1891. According to the City Council of Teixeira de Freitas (2017) and Teixeira de Freitas (2016), the Cascata Farm was founded by Joaquim Muniz de Almeida, tetravô of José Sérgio Figueiredo. Borborema (2019) brings great documentation about the genealogical record of the titularization of the Cascata Farm in Teixeira de Freitas. These references also bring records of Italian surenames that occupy the Extreme South of Bahia currently. Great son of Portuguese that came from the island of São Miguel in the Azores islands to Alcobaça in Bahia. Joaquim Muniz, already Brazilian, grandson of this first Portuguese called João José de Medeiros arrived in Alcobaça in 1780 and went over (subiu) the Itanhém River and took possession of the lands and started terras devolutas estado coffee and mandioca, 20-30 workers. From Joaquim Muniz to Joaquim Muniz de Almeida Filho until arrived in Joaquim Muniz de Almeida Filho, Quincas Neto who impulsionou o desenvolvimento fazenda acoes colonizacao abertura extremo Sul meados seculo XX. Época Joaquim Muniz de Almeida Neto mais de 40 famílias que moravam ali. Primeiro veio a escola para depois a igreja. 1930 já tinha uma escola na Fazenda Cascata. Professora Carmélia veio de Salvador para lecionar na Fazenda. Escola estadual na Fazenda Cascata. A escola continuou em atividade até a decada de 1980. The white element in Brazil is largely Portuguese. However, Because of Bandeiras Paulistas in Minas Gerais and Espirito Santo, the Italian presence in the Extreme South of Bahia can also be noticed. Borborema (2019) For example the white Italian element can be noticed in the names such as Maria Cristina Dal Monte Figueiredo in the Cascata Farm and Kária Armini. Italian names are not very common in the Extreme South of Bahia and come mainly from Espirito Santo whereas Portuguese names come from Bahia, Minas Gerais and Espírito Santo.
Construction of genetic risk databases can help decision-makers identify regions and individuals with higher genetic risk earlier, contributing to informed use of financial resources in the public health system.
Machine learning algorithms and artificial intelligence can contribute to improving public health policies by informing population groups that are at greater diseases risk using large scale datasets.
Mitochondrial genetic variants inform the risk of nervous system diseases such as neuroblastoma and neurodegenerative diseases.
We show that Brazilians of African mitochondrial matrilinear ancestry have variants for increased risk of neurodegenerative diseases and do not carry protective neuroblastoma variants.
Mitochondrial sequences are simple enough to be integrated with public databases such as the database of the SUS IT department (DATASUS) and allow population stratification by genomic ancestry estimation as well as a starting point for development of machine learning and artificial intelligence tools using ohmic data for the SUS health system.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Chaves, et al., "GERALDA: A framework for integration of genomic information into the database of the Brazilian Health System", The R Journal, 2024
BibTeX citation
@article{quokka-bilby,
author = {Chaves, Gepoliano and Ramos, Pablo Ivan and Dutra, Juliana and Gonçalves, Marilda},
title = {GERALDA: A framework for integration of genomic information into the database of the Brazilian Health System},
journal = {The R Journal},
year = {2024},
issn = {2073-4859},
pages = {1}
}